Day 45 - Beautiful Soup & Web scraping & robots.txt


Posted by pei_______ on 2022-05-26

learning from 100 Days of Code: The Complete Python Pro Bootcamp for 2022


Beautiful Soup Documentation


robots.txt

遵循漫遊器排除標準的純文字檔案,其中包含一或多項規則。

這些規則的作用是禁止 (或開放) 特定檢索器存取位於網站中的某個檔案路徑。

除非您在 robots.txt 檔案中另行指定,否則系統將允許檢索所有檔案。

以下是一個包含兩項規則的簡單 robots.txt 檔案:

User-agent: Googlebot
Disallow: /nogooglebot/

User-agent: *
Allow: /

Sitemap: http://www.example.com/sitemap.xml

Beautiful Soup 基本語法

導入工具 & 網頁資料

  1. 如果被網站擋掉python requests的請求,就需要自訂header
  2. HTTP HEADER
from bs4 import BeautifulSoup
import requests

# html from your desk
with open('website.html', encoding="utf-8") as file:
    contents = file.read()

# html online
header = {
    "User-Agent": "537.36 (KHTML, like Gecko) Chrome",
    "Accept-Language": "zh-TW"
}

response = requests.get(url=URL, headers=header)
html_content = response.text

印出網頁的html資料

soup = BeautifulSoup(contents, 'html.parser')
html_code_beautified = soup.prettify()

印出網站中"全部"符合條件的資料

find_all_item = soup.find_all(name='h3')
select_item = soup.select('h3')

'''
[<h3 class="capital">FIRST_DIV</h3>, <h3>first_div</h3>, <h3 class="capital">SECOND_DIV</h3>]
[<h3 class="capital">FIRST_DIV</h3>, <h3>first_div</h3>, <h3 class="capital">SECOND_DIV</h3>]
'''

印出網站中"第一筆"符合條件的資料

find_item = soup.find(name='h3', class_='capital')
select_one_item = soup.select_one('.first_div h3')
in_h3_tag = soup.h3

'''
<h3 class="capital">FIRST_DIV</h3>
<h3 class="capital">FIRST_DIV</h3>
<h3 class="capital">FIRST_DIV</h3>
'''

印出tag內的文字text

print(select_one_item.string)
print(select_one_item.text)
print(select_one_item.getText())

'''
FIRST_DIV
FIRST_DIV
FIRST_DIV
'''

印出tag的屬性

  • 可用來印'href'對應的連結
print(select_one_item.get('class'))

'''
['capital']
'''

從yc combinator抓取人氣最高的新聞

from bs4 import BeautifulSoup
import requests

response = requests.get('https://news.ycombinator.com/')
yc_web_page = response.text

soup = BeautifulSoup(yc_web_page, 'html.parser')

articles = soup.select('.title .titlelink')
article_strings = []
article_links = []

for article in articles:
    article_string = article.getText()
    article_strings.append(article_string)
    article_link = article.get('href')
    article_links.append(article_link)

article_upvotes = [int(score.getText().split()[0]) for score in soup.find_all(name='span', class_='score')]
highest_score = max(article_upvotes)
largest_index = article_upvotes.index(highest_score)

print(article_strings[largest_index])
print(article_links[largest_index])

抓取前100名的電影

import requests
from bs4 import BeautifulSoup

URL = "https://web.archive.org/web/20200518073855/https://www.empireonline.com/movies/features/best-movies-2/"

response = requests.get(url=URL)
website_html = response.text
soup = BeautifulSoup(website_html, 'html.parser')

movies = [title.getText() for title in soup.select('.article-title-description__text .title')]
movies = movies[::-1]

with open('movie.txt', 'w', encoding='UTF-8') as file:
    for movie in movies:
        file.write(f"{movie}\n")

#Python #課堂筆記 #100 Days of Code







Related Posts

What Type of Laser Engraving Machine Should be Used for Stainless Steel Engraving?

What Type of Laser Engraving Machine Should be Used for Stainless Steel Engraving?

N4.1_Webpack 是什麼及為什麼要用它?

N4.1_Webpack 是什麼及為什麼要用它?

[JS] 參數傳遞方式 Call by what?

[JS] 參數傳遞方式 Call by what?


Comments